Development and multimodal validation of a substance misuse algorithm for referral to treatment using artificial intelligence (SMART-AI): a retrospective deep learning study

Summary Background Substance misuse is a heterogeneous and complex set of behavioural conditions that are highly prevalent in hospital settings and frequently co-occur. Few hospital-wide solutions exist to comprehensively and reliably identify these conditions to prioritise care and guide treatment. The aim of this study was to apply natural language processing (NLP) to clinical notes collected in the electronic health record (EHR) to accurately screen for substance misuse. Methods The model was trained and developed on a reference dataset derived from a hospital-wide programme at Rush University Medical Center (RUMC), Chicago, IL, USA, that used structured diagnostic interviews to manually screen admitted patients over 27 months (between Oct 1, 2017, and Dec 31, 2019; n=54 915). The Alcohol Use Disorder Identification Test and Drug Abuse Screening Tool served as reference standards. The first 24 h of notes in the EHR were mapped to standardised medical vocabulary and fed into single-label, multilabel, and multilabel with auxillary-task neural network models. Temporal validation of the model was done using data from the subsequent 12 months on a subset of RUMC patients (n=16 917). External validation was done using data from Loyola University Medical Center, Chicago, IL, USA between Jan 1, 2007, and Sept 30, 2017 (n=1991 adult patients). The primary outcome was discrimination for alcohol misuse, opioid misuse, or non-opioid drug misuse. Discrimination was assessed by the area under the receiver operating characteristic curve (AUROC). Calibration slope and intercept were measured with the unreliability index. Bias assessments were performed across demographic subgroups. Findings The model was trained on a cohort that had 3·5% misuse (n=1 921) with any type of substance. 220 (11%) of 1921 patients with substance misuse had more than one type of misuse. The multilabel convolutional neural network classifier had a mean AUROC of 0·97 (95% CI 0·96–0·98) during temporal validation for all types of substance misuse. The model was well calibrated and showed good face validity with model features containing explicit mentions of aberrant drug-taking behaviour. A false-negative rate of 0·18–0·19 and a false-positive rate of 0·03 between non-Hispanic Black and non-Hispanic White groups occurred. In external validation, the AUROCs for alcohol and opioid misuse were 0·88 (95% CI 0·86–0·90) and 0·94 (0·92–0·95), respectively. Interpretation We developed a novel and accurate approach to leveraging the first 24 h of EHR notes for screening multiple types of substance misuse. Funding National Institute On Drug Abuse, National Institutes of Health.


Introduction
Background and objectives 3a D;V Explain the medical context (including whether diagnostic or prognostic) and rationale for developing or validating the multivariable prediction model, including references to existing models.

8-9
3b D;V Specify the objectives, including whether the study describes the development or validation of the model or both. 9

Methods
Source of data 4a D;V Describe the study design or source of data (e.g., randomized trial, cohort, or registry data), separately for the development and validation data sets, if applicable.
10,12 4b D;V Specify the key study dates, including start of accrual; end of accrual; and, if applicable, end of follow-up.

FIG. 1
Participants 5a D;V Specify key elements of the study setting (e.g., primary care, secondary care, general population) including number and location of centres. 14-15 6b D;V Report any actions to blind assessment of the outcome to be predicted.

12-13
Predictors 7a D;V Clearly define all predictors used in developing or validating the multivariable prediction model, including how and when they were measured.

12-13
7b D;V Report any actions to blind assessment of predictors for the outcome and other predictors.

15
Sample size 8 D;V Explain how the study size was arrived at.
14 Missing data 9 D;V Describe how missing data were handled (e.g., complete-case analysis, single imputation, multiple imputation) with details of any imputation method.

16, App 8
Statistical analysis methods 10a D Describe how predictors were handled in the analyses.

13, App 3-6
10b D Specify type of model, all model-building procedures (including any predictor selection), and method for internal validation.

13-14
10c V For validation, describe how the predictions were calculated.
14-15  13b D;V Describe the characteristics of the participants (basic demographics, clinical features, available predictors), including the number of participants with missing data for predictors and outcome. Table 1 13c V For validation, show a comparison with the development data of the distribution of important variables (demographics, predictors and outcome).

App 9
Model development 14a D Specify the number of participants and outcome events in each analysis. 14b D If done, report the unadjusted association between each candidate predictor and outcome.

19, App 13
Model specification 15a D Present the full prediction model to allow predictions for individuals (i.e., all regression coefficients, and model intercept or baseline survival at a given time point  x' here denotes a number between 0-9 if they exist for the ICD code Appendix 5. Methods for each architecture used 1) Multi-label learning The model trains on mutually exclusive labels with independent outcomes for alcohol misuse, opioid misuse, and non-opioid drug misuse. We did not provide any weights to the training loss for each misuse type.

2) Multi-task multi-label learning
The model trains on mutually exclusive labels with the independent outcomes for alcohol misuse, opioid misuse, and non-opioid drug misuse, along with substance use disorder-related ICD-9 and -10 codes as secondary labels [Figure]. In total, we had thirty different elixhauser comorbidity indexes, nine different SUD-related ICD code categories. We ran hcuppy python library to gather the elixhauser indexes for each encounter. We consider these extra labels as auxiliary labels intended to provide complexity to the model and improve the model learning capacity for the actual labels. We added weights for the loss as a hyperparameter during the model training. If Ls is a loss for actual outcomes and La for the auxiliary outputs, the total sum of loss is given by: Total Loss = weight * Ls + (1 -weight * La) Model Experiments: 1) Logistic Regression In logistic regression, the training dataset is fed into the model using a bag of CUIs approach. In this approach, we create a matrix of training datasets where every row is an encounter and columns as unique CUIs (n=37317). Each unique CUI in a training document is counted and normalized across the entire document. We experimented with the model using different penalty values, the inverse of regularization strength C ranging from 0.001 to 1000, and class weight as balanced.

2) Feed Forward Neural Network
In a feed-forward neural network, the training dataset is fed into the model using a bag of CUIs approach. In this approach, we create a matrix of training datasets where every row is an encounter and columns as unique CUIs (n=37317). Each unique CUI in a training document is counted and normalized across the entire document. The matrix passed through a fully connected or dense layer followed by ReLU for non-lineality. Finally, we add a sigmoid output to predict each substance misuse type. We tested with adam, rmsprop, and adagrad for optimizers, different dropouts between 0.1 to 0.9, and a learning rate from 0.01 to 0.0001.

3) Deep Averaging Neural Network
In a deep averaging neural network, the training dataset is limited to a maximum of 12000 words/CUIs. First, we create an embedding layer of dimension 300, which we average across the layer before sending the information to the dense layer, followed by ReLU for non-linearity. In the final layer, we add a sigmoid output for each substance misuse type. We experimented with adam and rmsprop optimizer, range of dropouts from 0.1 to 0.9 and learning rate from 0.01 to 0.0001. These neural networks had relatively fewer parameters to learn than the feed-forward neural network.

4) Convolutional Neural Network
In a convolutional neural network, the training dataset is also limited to a maximum of 12000 words/CUIs. First, we create an embedding layer of dimension 300, followed by the CNN layer with different filter sizes. The learnable weight in this layer is sharable; hence it can extract features from the embedding layer even with a shallow network. The extracted features are sent through the max-pooling layer, followed by a fully connected layer. In the fully connected layer, 2e experimented with units ranging from 8 to 2048. Again, in the final layer, we add a sigmoid output for each substance misuse type. We also tested with adam optimizer, range of dropouts from 0.1 to 0.9 and learning rate from 0.01 to 0.0001.

5) Transformer based Neural Network
Transformer-based neural networks are trending architecture that researchers in recent NLP breakthroughs have used. Transformer models use the attention mechanism, which provides context to each input sequence. Since the encounters have input sequences of larger than 500 CUIs, hence we avoid positional embeddings. The maximum input sequence was 6000, and we experimented with multiple attention heads and layers. Adam optimizer was used, with a learning rate range from 0.1 to 0.0001. We split the training data set into 90% training set (n = 49423) and 10% development set (n = 5492). We trained using the training set and the model selection using the development set. We used a random search approach to tune the hyperparameters across each of the models. In random search, hyperparameters are selected randomly from a pool of hyperparameter space. The process repeats several times until we find the best hyperparameters that give the highest pr auc score. Unlike grid search, we did not use every combination of parameters from the hyperparameter space for a random search.

No of Parameters
Hence, the parameter search time is quicker and still yields the best result. We experimented with eight different experiments using four Tesla V100 GPUs, python 3.6, PyTorch 1.4 version. For each of these experiments, we ran a random search until we found the best precision-recall area under the curve. BOW = bag of words, DAN = deep averaging network, CNN = convolutional neural network, d_model = the number of expected feature in encoder/decoder input, d_k = keys for dimension, d_v = values for dimension, n_head = number of heads in the multi attention layer, max length = maximum length of the document, dense layer = number of neurons Appendix 7. eXplainable Artificial Intelligence (XAI) with Local Interpretable Model-agnostic Explanations (LIME) We applied the LIME package to understand the highest weighted features that discriminate between cases and non-cases. LIME uses a local approximation to interpret a black box model such as neural networks by developing a linear model and assigning weights to each feature. For model selection, a grid search approach was applied to a small training dataset (30 documents) to get the best average R 2 value to find the best hyperparameters. The primary hyperparameters we experimented with were feature selection ("forward selection", "auto" "lasso_path", and "none") and kernel width (ranging from 1 to 7). Then, we ran LIME on the 2000 subset of the entire training dataset keeping the prevalence of substance misuse the same as that of the whole cohort.
The weights for all the features (n = 37371) for each document were averaged and sorted to produce the top 25 weighted features. Next, we repeated the experiment for all the misuse types. For each misuse type, the best hyperparameter selected was "auto" for feature selection and 2 for the kernel width.